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Abstract 

One of the most critical problems we face in the study of biological systems is building accu- 
rate statistical descriptions of them. This problem has been particularly challenging because 
biological systems typically contain large numbers of interacting elements, which precludes the 
use of standard brute force approaches. Recently, though, several groups have reported that 
there may be an alternate strategy. The reports show that reliable statistical models can be 
built without knowledge of all the interactions in a system; instead, pairwise interactions can 
suffice. These findings, however, are based on the analysis of small subsystems. Here we ask 
whether the observations will generalize to systems of realistic size, that is, whether pairwise 
models will provide reliable descriptions of true biological systems. Our results show that, in 
most cases, they will not. The reason is that there is a crossover in the predictive power of 
pairwise models: If the size of the subsystem is below the crossover point, then the results have 
no predictive power for large systems. If the size is above the crossover point, the results do have 
predictive power. This work thus provides a general framework for determining the extent to 
which pairwise models can be used to predict the behavior of whole biological systems. Applied 
to neural data, the size of most systems studied so far is below the crossover point. 



2 



1 Introduction 



Many important questions in biology are fundamentally statistical. For instance, deciphering the 
neural code requires knowledge of the probability of observing patterns of activity in response 
to stimuli [1]; determining which features of a protein are important for correct folding requires 
knowledge of the probability that a particular sequence of amino acids folds naturally [2, 3]; and 
determining the patterns of foraging of animals and their social and individual behavior requires 
knowledge of the distribution of food and species over both space and time [4-6] . 

Building statistical descriptions of biological systems is, however, hard. There are several 
reasons for this: i) biological systems are composed of large numbers of elements, and so can 
exhibit a huge number of configurations, in fact, an exponentially large number, ii) the elements 
typically interact with each other, making it impossible to view the system as a collection 
of independent entities, and iii) because of technological considerations, the descriptions of 
biological systems have to be built from very little data. For example, with current technology in 
neuroscience, we can record simultaneously from only about 100 neurons out of approximately 
100 billion in the human brain. So, not only are we faced with the problem of estimating 
probability distributions in high dimensional spaces, we must make the estimates based on very 
little information. 

Despite these apparent difficulties, recent work has suggested that the situation may be 
less bleak than it seems. There is evidence that accurate statistical description of systems can 
be achieved without having to examine all possible configurations [2,3,7-11]. One merely has 
to measure the probability distribution over pairs of elements and use those to build the full 
distribution. These "pairwise models" potentially offer a fundamental simplification, since the 
number of pairs is quadratic in the number of elements, not exponential. However, support 
for the efficacy of pairwise models has, necessarily, come from relatively small subsystems - 
small enough that the true probability distribution could be measured experimentally, allowing 
direct comparison of the pairwise distribution to the true one [7-9, 11]. While these studies 
have provided a key first step, a critical question remains: will the results from the analysis 
of these small subsystems extrapolate to large ones? That is, if a pairwise model predicts 
the probability distribution for a subset of the elements in a system, will it also predict the 
probability distribution for the whole system? Here we find that, for a biologically relevant 



3 



class of systems, this question can be answered quantitatively and, importantly, generically - 
independent of many of the details of the biological system under consideration. And the answer 
is, generally, "no." In this paper, we explain, both analytically and with simulations, why this 
is the case. 

2 Results 

2.1 The extrapolation problem 

To gain intuition into the extrapolation problem, let us consider a specific example: neuronal 
spike trains. Figure lA shows a typical spike train for a small population of neurons. Although 
the raw spike times provide a complete description, they are not a useful representation, as they 
are too high-dimensional. Therefore, we divide time into bins and re-represent the spike train as 
Os and Is: if there is no spike in a bin; 1 otherwise (Fig. IB) [7-9, 11]. For now we assume that 
the bins are independent (an assumption whose validity we discuss below, and in more detail in 
Sec. 3). The problem, then, is to find ptruei^) = Ptme 

(ri, r2, Tat) where rj is a binary variable 
indicating no spike (rj = 0) or one or more spikes (r^ = 1) on neuron i. Since this, too, is a high 
dimensional problem (though less so than the original spike time representation), suppose that 
we instead construct a pairwise approximation to ptrue, which we denote Ppair, for a population 
of size N. (The pairwise model derives its name from the fact that it has the same mean and 
pairwise correlations as the true model.) Our question, then, is: lippair is close to ptme for small 
N , what can we say about how close the two distributions are for large A^? 

To answer this question quantitatively, we need a measure of distance. The measure we use, 
denoted A^v, is defined in Eq. (8) below, but all we need to know about it for now is that if 
A^r = then Ppair = Ptrue, and if Atv is near one then Ppair is far from ptme- In terms of A^v, 
our main results are as follows: first, for small N, in what we call the perturbative regime, Ajy 
is proportional to A — 2. In other words, as the population size increases, the pairwise model 
becomes a worse and worse approximation to the true distribution. Second, this behavior is 
entirely generic: for small A, A^v increases linearly, no matter what the true distribution is. 
This is illustrated schematically in Fig. 2, which shows the generic behavior of A^v- The solid 
red part of the curve is the perturbative regime, where A^v is a linearly increasing function of 
A; the dashed curves show possible behavior beyond the perturbative regime. 
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Figure 1: Transforming spike trains to 
spike count. A. Spike rasters. Tick marks 
indicate spike times; different rows corre- 
spond to different neurons. The horizon- 
tal dashed hues are the bin boundaries. 
B. Spike count in each bin. In this exam- 
ple the bins are small enough that there is 
at most one spike per bin, but this is not 
necessary - one could use bigger bins and 
have larger spike counts. 



time 

These results have an important corollary: if one does an experiment and finds that Ajy is 
increasing linearly with N, then one has no information at all about the true distribution. The 
flip side of this is more encouraging: if one can measure the true distribution for sufficiently large 
N that Ajv saturates, as in the dashed blue line in Fig. 2, then one can have some confidence that 
extrapolation to large is meaningful. The implications for the interpretation of experiments 
is, then, that extrapolation to large N is valid only if one can analyze data past the perturbative 
regime. 

Under what conditions is a subsystem in the perturbative regime? The answer turns out 
to be simple: the size of the system, A^, times the average probability of observing a spike in 
a bin, must be small compared to 1. For example, if the average probability is 1/100, then a 
system will be in the perturbative regime if the number of neurons is small compared to 100. 
This last observation would seem to be good news for studies in which spikes are binned across 
time and temporal correlations are ignored. For such binned spike trains, the probability of 
a spike can be made arbitrarily small by simply shrinking the time bins, and so the size of 
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Figure 2: Cartoon illustrating the dependence 
of Ajv on A^. For small N there is always 
a perturbative regime in which Aat increases 
linearly with N (solid red line). When N be- 
comes large, on the other hand, A^v may con- 
tinue increasing with N (red and black dashed 
lines) or it may plateau (cyan dashed line), 
depending on ptme- The observation that A at 
increases linearly with does not, therefore, 
provide much, if any information about the 
large behavior. 

the population for which the pairwise model appears good can be made arbitrarily large. The 
problem with this, though, is that temporal correlations can be ignored only when time bins are 
large compared to the autocorrelation time. This leads to a kind of catch-22: pairwise models 
are guaranteed to work well (in the sense that they describe spike trains in which temporal 
correlations are ignored) if one uses small time bins, but small time bins is the one regime where 
ignoring temporal correlations is not a valid approximation. 

In the next several sections we quantify the qualitative picture presented above: we write 
down an explicit expression for A^v, explain why it increases linearly with N, when is small, 
and provide additional tests, besides assessing the linearity of A^r, to determine whether or not 
one is in the perturbative regime. 

2.2 A measure of goodness of fit 

A natural measure of the distance between pp^ir snidptme is the Kullback-Leibler (KL) divergence 
[12], denoted DKL{ptrue\\Ppair) and defined as 




Ptruejr) 
Ppair (l") 

The KL divergence is zero if the two distributions are equal; otherwise it is nonzero. 



DKL{Ptrue\\Ppair) = ^ Ptrue (r)log2 

Ppair (,r j 
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Although the KL divergence is a very natural measure, it is not easy to interpret (except, 
of course, when it is exactly zero). That is because a nonzero KL divergence tells us is that 
Ppair / Ptrue , but it docs not give us any real handle on how much we benefit by including the 
pairwise correlations in our approximation. To make sense of the KL divergence, then, we need 
something to compare it to. A reasonable reference quantity, used by a number of authors [7-9], 
is the KL divergence between the true distribution and the independent one, the latter denoted 
Pind- The independent distribution, as its name suggests, is a distribution in which the variables 
are taken to be independent. 



where Pi{ri) is the distribution of the response of the i*^ neuron, rj. With this choice for a 
comparison, we define our measure of goodness of fit as 



Note that the denominator in this expression, DKL{ptrue\\Pind), is usually referred to as the 
multi-information [7,13,14]. 

The quantity A^v, which we introduced in the previous section, lies between and 1, and 
measures how well a pairwise model does relative to an independent model. If it is 0, the 
pairwise model is equal to the true model (ppair(r) = Ptruei^))] if it is near 1, the pairwise model 
offers little improvement over the independent model; and if it is exactly 1, the pairwise model 
is equal to the independent model (ppair(r) = Pmd(r)), and so offers no improvement. (Our 
assertion that Ajy cannot exceed 1 assumes that the pairwise model cannot be worse than the 
independent one, something that is reasonable in practice but not guaranteed in general.) 

How do we attach physical meaning to the two divergences DKL{ptrue\\Ppair) and D k L{ptrue\\Pind)'^ 
For the latter, we use the fact that, as is easy to show, 

DKLiptrueWPind) = Sind — Strue, (4) 

where Sind and Stme are the entropies [15, 16] of pmd and ptme, respectively, defined, as usual, 
to be S[p] = — ^i.p(r) log2p(r). For the former, we use the definition of the KL divergence to 
write 

D K LiPtrue\\Ppair) — ~ ^^Ptruei^) loS2(Ppair-(r)) — Strue = Spair ~ Strue ■ i^) 



Pindin, . . . ,rAr) = JjK(ri) 



(2) 



DKL{Ptrue\\Ppair) 
DKL{PtTue\\Pind) 



(3) 
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Figure 3: Schematic plot of Sind (black line), 
Spair (cyan line) and Stme (red line) . The bet- 
ter the pairwise model, the closer Spair is to 
Strue- This is reflected in the cost function 



A TV , which is the distance between the red and 



cyan lines divided by the distance between the 



red and black lines. 
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The quantity Spair has the flavor of an entropy, although it is a true entropy only when Ppair 
is maximum entropy as well as pairwise (the maximum entropy pairwise model, or maximum 
entropy model for short; see Eq. (11)). For other pairwise distributions, all we need to know is 
that Spair lies between Stme and Sind, something that is guaranteed by our assumption that the 
pairwise model is no worse than the independent model. 

What Eqs. (4) and (5) tell us is that Ajv (Eq. (3)) is the ratio of the amount of "entropy" 
not explained by the pairwise model to the amount of entropy not explained by the independent 
model. A plot illustrating the relationship between A^v, the two entropies Sind and Stme, and 
the entropy-like quantity Spair, is shown in Fig. 3. 

2.3 Aiv in the perturbative regime 

The extrapolation problem discussed above is the problem of determining A/yr in the large 
limit. This is hard to do in general, but, as we show in Methods, Sec. 5.2, there is a perturbative 
regime in which it is possible. The small parameter that defines this regime is the average 
number of spikes per bin; this is written quantitatively as Nv6t where 6t is the bin size and V 
the average firing rate. Letting and Vi the firing rate of neuron i, the latter quantity is given by 



Note that our small parameter depends on N ^ which means that for u and 5t fixed, there is a 




(6) 
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maximum population size we can access perturbatively. We return to this point below, as it has 
major implications for experimental studies. 

The first step in the perturbation expansion is to compute the two quantities that make up 
An- DKLiPtrueWPind) and DKL{ptrue\\Ppair)- As we show in Methods, Sec. 5.2, these are given 
by 

DKLiPtrueWPind) = QindN [N - 1) {u5tf + O {{Nv5tf) (7a) 
DKL{ptrue\\Ppa^r) = 9pairN{N - 1){N - 2) (mf + O {{Nu6t)^) . (7b) 



The exact forms of the prefactors gind and gpair are given in Eqs. (41) and (43). The details, 
however, are not so important; the important things to know about them is that they are 
independent of N and uSt, and they depend on the low order statistics of the spike trains: 
gind depends on the second order normalized correlations function, and gpair depends on the 
second and third order normalized correlations function, as defined below in Eq. (13). The 
A^-dependence in the first term on the right hand side of Eq. (7a) has been noted previously [7], 
although the authors did not compute the prefactor, gind- 

Inserting Eq. (7) into Eq. (3) (into the definition of Ajv), we arrive at our main result, 

= ^(iV - 2) u5t + O {{Nu5tf) . (8) 

9ind 

This expression tells us how A jv scales with N in the perturbative regime - the regime in which 
Nv5t <^ 1. The key observation about this scaling is that it is independent of the details of the 
true distribution, ptme- This has a very important consequence, one that has major implications 
for experimental data: if one does an experiment and finds that that A^r is proportional to A^ — 2, 
then the system is in the perturbative regime, and one does not know whether Ppair will remain 
close to ptrue as increases. What this means in practical terms is that if one wants to know 
whether a particular pairwise model is a good one for large systems, it is necessary to consider 
values of N that are significantly greater than Nc, where 

We interpret Nc as the value at which there is a crossover in the behavior of the pairwise model. 
Specifically, if ^ Nc, the system is in the perturbative regime and the pairwise model is not 



informative about the large N behavior, whereas if ^ Nc, the system is in a regime in which 
it may be possible to make inferences about the behavior of the full system. 

2.4 The dangers of extrapolation 

Although the behavior of Atv in the perturbative regime does not tell us much about its behavior 
at large N, it is possible that other quantities that can be calculated in the perturbative regime, 
gind, 9pair, and Sind (the last one exactly), are informative, as others have suggested [7]. Here 
we show that they are also uninformative. 

The easiest way to relate the perturbative regime to the large N regime is to extrapolate 
Eqs. (7a) and (7b), and ask what their large N behavior tells us. Generic versions of these 
extrapolations, plotted on a log-log scale, are shown in Fig. 4A, along with a plot of the inde- 
pendent entropy, Sind (which is necessarily linear in A^; see Sec. 5.1). The first thing we notice 
about the extrapolations is that they do not, technically, have a large N behavior: one termi- 
nates at the point labeled Nind, which is where D k L{ptrue\\Pind) = S^d (and thus Stme = 0; 
continuing the extrapolation implies negative true entropy); the other at the point labeled Npair, 

which is where DKL{Ptrue\\Ppair) = Sind (and thus Stme < 0, since Spair < Sind)- 

Despite the fact that the extrapolations end abruptly, they still might provide information 
about the large N regime. For example, Npair and/or Nind might be values of N at which some- 
thing interesting happens. To see if this is the case, in Fig. 4B we plot the naive extrapolations 
of Spair and Stme, as given by Eq. (7), on a linear-linear plot, along with Sind (solid lines). This 
plot contains no new information compared to Fig. 4A, but it does elucidate the meaning of the 
extrapolations. Perhaps its most striking feature is that the naive extrapolation of Stme has a 
decreasing portion. As is easy to show mathematically, this cannot happen (intuitively, that 
is because observing one additional neuron cannot decrease the entropy of previously observed 
neurons). Thus, Nind, which occurs well beyond the point where the naive extrapolation of Stme 
is decreasing, has essentially no meaning, something that has been pointed out previously by 
Bethge et al. [10]. The other potentially important value of N is Npair- This, though, suffers 
from similar problems: either Npair > Nind, in which case the entropy is negative, or it crosses 
the green curve in Fig. 4A from below, meaning Ajv > 1. Either way, it also doesn't have much 
meaning. 

How do the naively extrapolated entropies - the solid lines in Fig. 4B - compare to the 
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Figure 4: Cartoon showing the extrapolation of entropy and KL divergences, and illustrating 
why the two natural quantities derived from it, A^i„,d and Npair, occur beyond the point at which 
the extrapolation is meaningful. A. Extrapolations on a log-log scale. Black: Sind versus A^; 
green: extrapolation of DxL(ptr«e||Pm(i); cyan: extrapolation of L>i^L(Pirne I bmaxent)- The red 
points are the data. The points Nind and Npair label the intersections of the two extrapolations 
with the independent entropy, S^d- B. Extrapolation of the entropies rather than the KL 
divergences, plotted on a linear-linear scale. The data, again shown in red, is barely visible in 
the lower left hand corner. Black: Sind versus N; solid maroon: extrapolation of Spair', solid 
orange: extrapolation of Strue- The dashed maroon and orange lines are the extrapolations of 
the true pairwise "entropy" and true entropy, respectively. 
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true entropies? To answer this, in Fig. 4B we show the true behavior of Stme and Spair versus 
N (dashed hues). Note that Stme is asymptoticaUy hnear in N , even though the neurons are 
correlated, a fact that forces Spair to be hnear in A^, as it is sandwiched between Stme and 
Sind- (The asymptotically linear behavior of Stme is typical, even in highly correlated systems. 
Although this is not always appreciated, it is easy to show; see Sec. 5.1.) Comparing the dashed 
and solid lines, we see that the naively extrapolated and true entropies, and thus the naively 
extrapolated and true values of A^r, have extremely different behavior. This further suggests 
that there is very little connection between the perturbative and large regimes. 

These observations can be summarized by noting that gind and gpair depend only on correla- 
tion coefficients up to third order (see Eqs. (41) and (43)), whereas the large behavior depends 
on correlations at all orders. Thus, since gind and Qpair tell us very little, if anything, about 
higher order correlations, it is not surprising that they tell us very little about the behavior of 
Atv in the large N limit. 

2.5 Numerical simulations 

To check that our perturbation expansions, Eqs. (7) and (8), are correct, and to investigate 
the regime in which they are valid, we performed numerical simulations. We generated, from 
synthetic data, a set of true distributions, computed D k liPtmeWPind) ■, D k LivtmeWPpair) ■, and 
Atv numerically for each of them, and compared to the values predicted by Eqs. (7) and (8). 
The results are shown in Fig. 5. Before discussing that figure, though, we explain our procedure 
for constructing true distributions. 

The set of true distributions we used were generated from a third order model (so named 
because it includes up to third order interactions). This model has the form 



Ptme{ri, . . .,rN*) = ^^exp 



E ^^"^^ + E J'r^-^^j + E K'^k'^^^j^k (10) 

i i<j i<j<k 

where Ztme is a normalization constant, chosen to ensure the the probability distribution sums 



to 1, and the sums over i, j and k run from 1 to N*. The parameters hf"^^, J^J"*^ and KjjJ^^ were 
chosen by sampling from distributions (see Methods, Sec. 5.3), which allowed us to generate 
many different true distributions. 

For a particular simulation (corresponding to a column in Fig. 5), we generated a true 
distribution with N* = 15, randomly chose 5 neurons, and marginalized over them. This gave us 
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a 10-neuron true distribution. True distributions with < 10 were constructed by marginalizing 
over additional neurons within our 10-neuron population. To achieve a representative sample, we 
considered all possible marginalizations (of which there are 10 choose N, or 10!/[A^!(10 — -/V)!])- 
The results in Fig. 5 are averages over these marginalizations. 

For neural data, the most commonly used pairwise model is the maximum entropy model. 
Therefore, we use that one here. To emphasize the maximum entropy nature of this model, we 
replace the label ^^paii^^ that we have been using so far with ^^maxent." The maximum entropy 
distribution has the form 



Pmaxentiy) ^ exp 



hin + Ji. 

i i<j 



ill) 



Fitting this distribution requires that we choose the hi and Jij so that the first and second 
moments match those of the true distribution. Quantitatively, these conditions are 



{l~i)rnaxent — (j'i)true (12a) 
{t iT j) Ynaxent — {^i'fj)true (1^1*) 



where the angle brackets, (. . . )maxent and (. . . )true, represent averages with respect to Pmaxent 
and ptruei respectively. Once we have hi and Jij that satisfy Eq. (12), we calculate the KL 
divergences, Eqs. (7), and use those to compute A^v- 

The results are shown in Fig. 5. The rows correspond to our three quantities of interest: 
DKL{ptrue\\Pind), Dr LiPtrue\\Ppair) , and Ajv (top to bottom). The columus correspond to dif- 
ferent values of 1761, with the smallest 175t on the left and the largest on the right. Red circles 
are the actual values of these quantities; blue ones are the predictions from Eqs. (7) and (8). 

As suggested by our perturbation analysis, the smaller the value of 176t, the better the 
agreement between computed and predicted values. Our simulations corroborate this: the left 
column of Fig. 5 has udt = 0.024, and agreement is almost perfect out to = 10; the middle 
column has Vdt = 0.029, and agreement is almost perfect out to = 7; and the right column 
has 175t = 0.037, and agreement is not good for any value of N. 
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Figure 5: The N dependence of the KL divergences and the goodness of fit, Ajy. Data was 
generated from a third order model, as explained in Sec. 5.3 (Methods), and fit to maximum 
entropy pairwise model and independent models. All data points correspond to averages over 
marginalizations of the true distribution (see text for details). The red points were computed 
directly from the model fits, Eqs. (1), (3) and (4); the blue points are predictions of the pertur- 
bative expansions, Eqs. (7) and (8). The three columns correspond to udt = 0.024, 0.029, and 
0.037, from left to right. A, B, C {l75t = 0.024). Predictions from the perturbative expansion 
are in good agreement with the measurements up to = 10, indicating that the data is in the 
perturbative regime. D, E, F (udt = 0.029). Predictions from the perturbative expansion are in 
good agreement with the measurements up to A = 7, indicating that the data is only partially 
in the perturbative regime. G, H, I (v6t = 0.037). Predictions from the perturbative expansion 
are not in good agreement with the measurements, even for small A, indicating that the data 
is outside the perturbative regime. 
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These results validate the perturbation expansions in Eqs. (7) and (8), and show that those 
expansions provide sensible predictions - at least for some parameters. They also suggest a 
natural way to assess the significance of one's data: plot DKL{ptrue\\Pind), DKL{ptrue\\Ppair), 
and A^r versus A^, and look for agreement with the predictions of the perturbation expansion. 
If agreement is good, as in the left column of Fig. 5, then one is in the perturbative regime, and 
one knows very little about the true distribution. If, on the other hand, agreement is bad, as 
in the right column, then one is out of the perturbative regime, and there is hope of extracting 
meaningful information about the relationship between the true and pairwise models. 

That said, the qualifier "at least for some parameters" is an important one. This is because 
the perturbation expansion is essentially an expansion that depends on the normalized corre- 
lation coefficients, and there is an underlying assumption that they don't exhibit pathological 
behavior. The k^^ normalized correlation coefficient for the distribution p, denoted • • , is 
written 



V _ {{{^ii (^n)p)((^t2 (^»2)p) • • • ((^ifc {'''ik)p))p 

Pili2-ik — Ij.. \ U. ) (r- ) ' ^ ^ 

\'ll/p\'«2/P • • • \'«fc/P 

A potentially problematic feature of the correlation coefficients is that the denominator is a 
product over mean activities. If the mean activities are small, the denominator can become 
very small, leading to very large correlation coefficients. Although our perturbation expansion 
is always valid for sufficiently small time bins (because the correlation coefficients eventually 
becomes independent of bin size; see Methods, Sec. 5.4), "sufficiently small" can depend in 
detail on the parameters. For instance, at the maximum population size tested (A^ = 10) and 
for the true distributions that had 176t < 0.03, the absolute error of the prediction had a median 
of approximately 16%. However, about 11% of the runs had errors larger than 60%. Thus, the 
exact size of the small parameter at which the perturbative expansion breaks down can depend 
on the details of the true distribution. 

2.6 Local fields and pairwise couplings have a simple dependence on firing 
rates and correlation coefiicients in the perturbative regime 

Estimation of the KL divergences and Ajv from real data can be hard, in the sense that it takes 
a large amount of data for them to converge to their true values. We therefore provide a second 
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set of relationships that can be used to determine whether or not a particular data set is in the 
perturbative regime. These relationships are between the parameters of the maximum entropy 
model, the hi and Jij, and the mean activity and normalized second order correlation coefficient 
(the latter defined in Eq. (13)). 

Since the quantity Vdt plays a central role in our analysis, we replace it with a single param- 
eter, which we denote 6, 

5 = V5t. (14) 

In terms of this parameter, we find, (using the same perturbative approach that led us to Eqs. 
(7) and (8); see Sec. 5.2.2), that 

hi = \og[{ny^ -1\+0{N5) (15a) 
Jij = log [1 + pij] + 0{N5) (15b) 



where pij, the normalized second order correlation coefficient, is defined in Eq. (13) with k = 2; 
it is given explicitly by 

{rjrj) - {ri)irj) 

(We don't need a superscript on p or a subscript on the angle brackets because the first and 
second moments are the same under the true and pairwise distributions.) 

Equation (15) tells us that the A^-dependence of the hi and Jij, the local fields and pairwise 
couplings, are very weak. In Fig. 6 we confirm these theoretical predictions through numerical 
simulations. 

As an aside, we should point out that the A^-dependence is a function of the variables 
used to represent the firing patterns. Here we use for silence and 1 for firing, but another, 
possibly more common, representation, derived from the Ising model and used in a number 
of studies [7,9,11], is to use —1 for silence and +1 for firing. This amounts to making the 
change of variables Si = 2ri — 1. In terms of Sj, the maximum entropy model has the form 
P[r) ~ exp ^ . /i. + }2i<j Jij SiSj 



where ^ and J- are given by 



ji^ng ^ Jjl _ ^^^^^ 



16 



i3 -4 




-6 -4 -2 

/7. .predicted 



-4.5 




-6 -4 -2 

/j, .predicted 



-4- 




-4.5J ***,, 



5 10 



•I 



-6 -4 -2 

/7. , predicted 



-2 



B 



0.5-1 

o■^ — I- y 



♦ 



2 

^ .predicted 




-4-2 2 

predicted 



-6 



0.5 



»•**«* 




5 ^10 



-6 



-2 2 

predicted 



Figure 6: Comparison of the true local fields (/ij, top row) and pairwise interactions (J/j, bottom 
row), to the predictions from the perturbation expansion, Eq. (15). Values of N ranging from 
5 to 10 are shown, with different colors corresponding to different A^s. For each value of N ^ 
fits are shown for 45 realization of the true distribution. Insets show the A^-dependence of the 
mean local fields (top) and mean pairwise interactions (bottom). The three columns correspond 
exactly to the columns in Fig. 5. A, B (^bt = 0.024). There is a very good match between 
the theoretical and fit values of both local fields and pairwise interactions. C, D (p6t = 0.029). 
Even though Vdt has increased, the match is still good. E, F (p6t = 0.037). The predicted 
and fit local fields and pairwise interactions do not match as well as the cases shown in A,B, 
C and D. There is also now a strong A^-dependence in the mean local fields, and a somewhat 
weaker dependence in the pairwise interactions. This indicates that the perturbative expansion 
is breaking down. 
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The second term on the right hand side of Eq. (17a) is proportional to — 1, which means the 
local fields in the Ising representation acquire a linear A^-dependence that was not present in 
our 0/1 representation. The two studies that reported the A^-dependence of the local fields [7, 9] 
used this representation, and, as predicted, their local fields had a component that was linear 
in N. 

Equation (15b) does more than just predict a lack of A^-dependence; it also provides a 
functional relationship between the pairwise couplings and the normalized pairwise correlations 
function, pij. In Figs. 7A-C we plot the pairwise couplings, Jij, versus the normalized pairwise 
correlation coefficient, pij (blue dots), along with the prediction from Eq. (15b) (black line). 
Consistent with with our predictions, the data in Figs. 7A-C essentially follows a line - the one 
predicted by Eq. (15b). 

A relationship between the pairwise couplings and the correlations coefficients has been 
sought previously, but for the more standard Pearson correlation coefficient [7, 9, 11]. Our anal- 
ysis explains why it was not found. The Pearson correlation coefficient, denoted Cij, is given 
by 



{riVj) - {ri){rj) ^^^^ 



In the small (rj) limit - the limit of interest - the right hand side of Eq. (18) is approximately 
equal to [{ri){rj)]^^'^ pij. Because [{f'i){rj)]^^'^ depends on the local fields, hi and hj (see Eq. (15a)) 
and there is a one-to-one relationship between pij and Jij (Eq. (15b)), there can't be a one-to- 
one relationship between Cij and Jij. We verify the lack of a relationship in Figs. 7D-E, where 
we again plot Jij, but this time versus the standard correlation coefficient, Cij. As predicted, 
the data in Figs. 7D-E is scattered over two dimensions. This suggests that pij, not Qj, is the 
natural measure of the correlation between two neurons when the have a binary representation, 
something that has also been suggested by Amari based on information-geometric arguments 
[17]. 

Note that the lack of a simple relationship between the pairwise couplings and the standard 
correlation coefficient has been a major motivation in building maximum entropy models [7, 11]. 
This is for good reason: if there is a simple relationship, knowing the Jjj's adds essentially 
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Figure 7: Pairwise couplings versus pairwise correlations, showing that there is a simple relation 
between Jij and pij but not between Jij and Cij. Top row: Jij versus the normalized coefficients, 
Pij (blue points), along with predicted relationship, via Eq. (15b) (black line). Bottom row: 
Jij versus the Pearson correlation coefficients, Cij, Eq. (18) (blue points). The three columns 
correspond exactly to the columns in Fig. 5, for which Vdt = 0.024, 0.029, and 0.037, from left 
to right. The prediction in the top row (black line) matches the data well, even in the rightmost 
column. 
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nothing. Thus, plotting Jij versus pij (but not Cij) is an important test of one's data, and if the 
two quantities faU on the curve predicted by Eq. (15b), the maximum entropy model is adding 
very little information, if any. 

3 Is there anything wrong with using small time bins? 

An outcome of our perturbative approach is that the goodness of fit measure, Aat, decreases 
linearly with bin size (see Eq. (3)). This suggests that one could make the pairwise model look 
better and better simply by making the bin size smaller and smaller. Is there anything wrong 
with this? The answer is yes, for reasons we discussed in Sec. 2.1; here we emphasize and expand 
on this issue, as it is an important one for making sense of experimental results. 

The problem arises because what we have been calling the "true" distribution is not really 
the true distribution of spike trains. It is the distribution assuming independent time bins, an 
assumption that becomes worse and worse as we make the bins smaller and smaller. (We use 
this potentially confusing nomenclature primarily because all studies of neuronal data carried 
out so far have assumed temporal independence, and compared the pairwise distribution to the 
temporally independent - but still correlated across neurons - distribution [7-9, 11]. In addition, 
the correct name "true under the assumption of temporal independence," is unwieldy.) Here 
we quantify how much worse. In particular, we show that if one uses time bins that are small 
compared to the characteristic correlation time in the spike trains, the pairwise model will not 
provide a good description of the data. Essentially, we show that, when the time bins are too 
small, the error one makes in ignoring temporal correlations is larger than the error one makes 
in ignoring correlations across neurons. 

As usual, we divide time into bins of size 6t. However, because we are dropping the indepen- 
dence assumption, we use r*, rather than r, to denote the response in bin t. The full probability 
distribution over all time bins is denoted V{v^ ^ . . . , r^^). Here M is the number of bins; it is 
equal to T/5t for spike trains of length T. 

If time bins are approximately independent, we can write 




(19) 
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and if the pairwise model is a good one, we have 



Ptrueij^ ) ~ Ppairij^ )• 



(20) 



Combining Eqs. (19) and Eq. (20) then gives us an especially simple expression for the full 
probability distribution: V{r^, . . . , r^^) ~ Ppa"'(r*)- 

The problem with small time bins lies in Eq. (19): the right hand side is a good approximation 
to the true distribution when the time bins are large compared to the spike train correlation 
time, but it is a bad approximation when the time bins are small (because adjacent time bins 
become highly correlated). To quantify how bad, we compare the error one makes assuming 
independence across time to the error one makes assuming independence across neurons. The 
ratio of those two errors, denoted 7, is given by 



It is relatively easy to compute 7 in the limit of small time bins (see Sec. 5.5), and we find that 



As expected, this reduces to our old result, Ajv, when there is only one time bin (M = 1). 
When M is larger than 1, however (which is, of course, the case of interest), the second term is 
always at least one, and for small bin size, the third term is much larger than one. Consequently, 
if we use bins that are small compared to the temporal correlation time of the spike train, the 
pairwise model will do a very bad job describing the full, temporally correlated spike trains. 

4 Discussion 

Probability distributions over the configurations of biological systems are extremely important 
quantities. However, because of the large number of interacting elements comprising such sys- 
tems, these distributions can almost never be determined directly from experimental data. Using 
parametric models to approximate the true distribution is the only existing alternative. While 
such models are promising, they are typically applied only to small subsystems, not the full 
system. This raises the question: are they good models of the full system?. 

We answered this question for a class of parametric models known as pairwise models. We 
focused on a particular application, neuronal spike trains, and our main result is as follows: if 



7 = 



Dkl {V{r\...,r''')\\l\tPpa^r{r')) 
MDKL{p{r)\\pind{r)) 
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one were to record spikes from multiple neurons, use sufficiently small time bins and a sufficiently 
small number of cells, and assume temporal independence, then a pairwise model would almost 
always succeed in matching the true (but temporally independent) distribution - whether or not 
it would match the true (but still temporally independent) distribution for large time bins or a 
large number of cells. In other words, pairwise models in the "sufficiently small" regime, what 
we refer to as the perturbative regime, have almost no predictive value for what will happen 
with large populations. This makes extrapolation from small to large systems dangerous. 

This observation is important because pairwise models, and in particular maximum entropy 
pairwise models, have recently attracted a great deal of attention: they have been applied to 
salamander and guinea pig retinas [7], primate retina [8], primate cortex [9], cultured cortical 
networks [7], and cat visual cortex [11]. These studies have mainly operated close to the per- 
turbative regime. For example, Schneidman et al. [7] had Nu6t ^ 0.35, Tang et al. [9] had 
NvSt ~ 0.06 to 0.4 (depending on the preparation), and Yu et al. had Nv5t ~ 0.2. For these 
studies, then, it would be hard to justify extrapolating to large populations. 

The study by Shlens et al. [8], on the other hand, might be more amenable to extrapolation. 
This is because spatially localized stimuli were used to stimulate retinal ganglion cells, for which a 
nearest neighbor maximum entropy models provided a good fit to their data. (Nearest neighbor 
means Jij is zero unless cell i and cell j are adjacent.) As is not hard to show, for nearest 
neighbor models the small parameter in the perturbative expansion is KVSt where K is the 
number of nearest neighbors. Since K is fixed, independent of the population size, the small 
parameter will not change as the population size increases. Thus, Shlens et al. may have tapped 
into the large population behavior even though they considered only a few cells at a time in 
their analysis. 

4.1 Time bins and population size 

That the pairwise model is always good if Nv5t is sufficiently small has strong implications: if 
we want to build a good model for a particular N, we can simply choose a bin size that is small 
compared to l/NV. However, one of the assumptions in all pairwise models used on neural data 
is that bins at different times are independent. This produces a tension between small time 
bins and temporal independence: small time bins essentially ensure that a pairwise model will 
provide a close approximation to a model with independent bins, but they make adjacent bins 
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highly correlated. Large time bins come with no such assurance, but they make adjacent bins 
independent. Unfortunately, this tension is often unresolvable in large populations, in the sense 
that pairwise models are assured to work only up to populations of size l/ij^Tcorr) where Tcorr 
is the typical correlation time. Given that v is at least several Hz, for experimental paradigms 
in which the correlation time is more than a few hundred ms, l/ii'Tcorr) is about one, and this 
assurance does not apply to even moderately sized populations of neurons. 

These observations are especially relevant for studies that use small time bins to model spike 
trains driven by natural stimuli. This is because the long correlation times inherent in natural 
stimuli are passed on to the spike trains, so the assumption of independence across time (which 
is required for the independence assumption to be valid) breaks badly. Knowing that these 
models are successful in describing spike trains under the independence assumption, then, does 
not tell us whether they will be successful in describing full, temporally correlated, spike trains. 
Note that for studies that use stimuli with short correlation times (e.g., non-natural stimuli such 
as white noise), the temporal correlations in the spike trains are likely to be short, and using 
small time bins may be perfectly valid. 

The only study that has investigated the issue of temporal correlations in maximum entropy 
models does indeed support the above picture [9]: for the parameters used in that study {Nu6t = 
0.06 to 0.4), the maximum entropy pairwise model provided a good fit to the data (Aat was 
typically smaller than 0.1), but it did not do a good job modeling the temporal structure of the 
spike trains. 

4.2 Other biological problems that have been approached with pairwise 
models, e.g, protein folding 

As mentioned in the Introduction, in addition to the studies on neuronal data, studies on protein 
folding have also emphasized the role of pairwise interactions [2,3]. Briefiy, proteins consist of 
strings of amino acids, and a major question in structural biology is: what is the probability 
distribution of amino acid strings in naturally folding proteins? One way to answer this is to 
approximate the full probability distribution of naturally folding proteins from knowledge of 
single-site and pairwise distributions. One can show that there is a perturbative regime for 
proteins as well. This can be readily seen using the celebrated HP protein model [18]), where 
a protein is composed of only two types of amino acids: if, at each site, one amino acid type is 
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preferred and occurs with high probabihty, say 1 — 5 (5 <C 1), then a protein of length shorter 
than 1/5 will be in the perturbative regime, and, therefore, a good match between the true 
distribution and the pairwise distribution for such a protein is virtually guaranteed. 

Fortunately, though, the properties of real proteins generally prevent this from happening: 
at the majority of sites in a protein, the distribution of amino acids is not sharply peaked around 
one amino acid. Even for those sites that are sharply peaked (the evolutionarily-conserved sites), 
the probability of the most likely amino acid rarely exceeds 90% [19,20]. This puts proteins 
consisting of only a few amino acids out of the perturbative regime, and puts longer proteins - 
the ones usually studied using pairwise models - well out of it. 

This difference is fundamental: because many of the studies that have been carried out on 
neural data were in the perturbative regime, the conclusions of those studies - specifically, the 
conclusion that pairwise models provide accurate descriptions of large populations of neurons - 
is not yet supported. This is not the case for the protein studies, because they are not in the 
perturbative regime. Thus, the evidence that pairwise models provide accurate descriptions of 
protein folding remain strong and exceedingly promising. 

4.3 Outlook 

We have developed a framework for assessing the validity of pairwise models applied to small 
systems. Essentially, we developed a set of tests to determine whether one's data is in the 
perturbative regime, a regime in which extrapolation to large populations is not warranted. 
This should serve as a useful guide, not just for analyzing experiments, but also for designing 
them. 

Although our framework is general, we focused primarily on its application to neural data. 
One of our main results is that the bin size carries an important tradeoff: if the bin size is 
small, then pairwise models work well, but at the price of ignoring temporal correlations; if the 
bin size is large enough so that adjacent bins are weakly correlated, then there is no guarantee 
that pairwise models will work at all. Pairwise models with small time bins, therefore, might be 
rescued by a small modification: take into account correlations across time as well as neurons. 
This would increase the complexity of the models, but the amount of data one needs to fit them 
would still not be so large, as only pairwise correlations and single neuron firing rates need to 
be estimated. Whether this modification would produce good models is not clear, but if it did 
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it would bring us much closer to a fundamental understanding of neural systems. 



5 Methods 



5.1 The behavior of the true entropy in the large N Hmit 

To understand how the true entropy behaves in the large N limit, we need only express the 
difference of the entropies as a mutual information. Using Spf to denote the true entropy of N 
neurons and 1(1; N) to denote the mutual information between one neuron and the other 
neurons in a population of size + 1, we have 



If knowing the activity of N neurons does not constrain the firing of neuron + 1, then the 
single neuron entropy, Si, will exceed the mutual information, 1(1; N), and the entropy will be 
an increasing function of N. For the entropy to be linear in N, all we need is that the mutual 
information saturates with A^. Because of synaptic failures, this is a reasonable assumption for 
networks of neurons: even if we observed all the neurons, there is still residual noise associated 
with uncertainty about which vesicles release neurotransmitter. Thus, using /(l;cx3) to denote 
the asymptotic value of the mutual information and {Si) to denote the average single-neuron 
entropy, we have 



where the corrections are sublinear in A^. 
5.2 Perturbative Expansion 

Our main quantitative result, given in Eq. (7), is that the KL divergence between the true distri- 
bution and both the independent and pairwise distributions can be computed perturbatively as 
an expansion in powers of N6. Here we carry out this expansion, and derive explicit expressions 
for the quantities gind and gpair- 

To simplify our notation, here we use p{r) for the true distribution. The critical step in 
computing the KL divergences perturbatively is to use the Sarmanov-Lancaster expansion [21- 



{Sn + Si) - Sn+1 = 1(1; N) Sn+1 -Sn = Si- /(I; A^) . 



(23) 



Sn = N[{Si) — /(I; oo)] + corrections 



(24) 
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25] for p{r), 

p(r) =p,„d(r)(l+ep(r)) (25) 
where 

P^nci{r) = n rf'^^^^S ^1 (26a) 
^p(r) ^ E ^^'^^^^J- + E ^^.k^^^^^^^^k + ... (26b) 

i<j i<j<k 

5ri = ri-fi (26c) 
f, = (l + exp(-W,))-^ (26d) 

This expansion has a number of important, but not immediately obvious, properties. First, 
as can be shown by a direct calculation, 

{ri)p = {ri)ind = n (27) 

where the subscripts p and ind indicate an average with respect to p{r) and Pind(r), respectively. 
This has an immediate corollary, 

{Sn)ind = 0. (28) 

This last relationship is important, because it tells us that if a product of Jr's contains any terms 
linear in one of the 6ri, the whole product averages to zero under the independent distribution. 
This can be used to show that 

{U^)hnd = (29) 
from which it follows that 

Ep(r) = ((l+ep(r))ind = l, (30) 
r 

which tells us that p(r) is properly normalized. Finally, a slightly more involved calculations 
provides us with a relationship between the parameters of the model and the moments: for 
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{6riSrj)p = ri{l - ri)rj{l - rj)jf- 
{5ri6rj6rk)p = fj(l - fi)fj(l - fj)fk{l - rk)}C^jf, ■ 



(31a) 
(31b) 



Virtually identical expressions hold for higher order moments. It is this last set of relationships 
that make the Sarmanov-Lancaster expansion so useful. 

Note that Eqs. (31a) and (31b), along with the expression for the normalized correlation 
coefficients given in Eq. (13), imply that 



(l-r-,)(l-f,)J,^ = /^^. (32a) 
(l-r-)(l-f,)(l-r-,)/Cf^., = pf_,.,. (32b) 

These identities will be extremely useful for simplifying expressions later on. 

Because the moments are so closely related to the parameters of the distribution, moment 
matching is especially convenient: to construct a distribution whose moments match those of 
p{r) up to some order, one simply needs to ensure that the parameters of that distribution, Tii, 
J'ij, ICijk, etc., are identical to those of the true distributions up to the order of interest. In 
particular, let us write down a new distribution, q{r), 



g(r) =p,„d(r)(l+e,(r)) (33a) 
i<3 i<j<k 

We can recover the independent distribution by letting ^g(r) = 0, and we can recover the pairwise 
distribution by letting = Jf-. This allows us to compute Dkl{p\\q) in the general case, and 
then either set to zero or set J^^ to Jfy 

Note that expressions analogous to those in Eq. (29-32) exist for averages with respect to 
q{r); the only difference is that p is replaced by q. 
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5.2.1 The KL divergence in the Sarmanov-Lancaster representation 

Using Eqs. (25) and (33a) and a small amount of algebra, the KL divergence between p{r) and 
g'(r) may be written 

where 

fix, y) = {l + x) [ln(l + x) - ln(l + y)] - (x - y) . (35) 

To derive Eq. (34), we used the fact that {S,p)ind = {S,q)ind = (see Eq. (29)). The extra term 
{x—y) was included to ensure that /(x, y) and its first derivatives vanish at a; = y = 0, something 
that greatly simplifies our analysis later on. 

Our approach is to Taylor expand the right hand side of Eq. (34) around £,p = £,q = 0, 
compute each term, and then sum the whole series. Using Unm to denote the coefficients of the 
Taylor series, we have 

DkUpM = :^Y.^-^n{U^rUrr).nd ■ (36) 

mn 

Note that because f{x,y) and its first derivatives vanish at x = y = 0, all terms in this sum 
have m -\- n > 2. 

Because both S^p and are themselves sums, an exact calculation of the terms in Eq. (36) 
would be difficult. However, as we show in Sec. 5.2.3 (in particular Eqs. (51) and (52)), they 
can be computed to lowest order in N5, and the result is 

i<j 
i<j<k 

where p^^-^ and p^^^, are given by 



~x — X j_ „x j_ X , X _ {fifjfk)x nrjrk , . 

Pijk — Pijk + Pij + Pik + Pjk — ) W^J 

X = p,q. The last equality in Eq. (38) follows from a small amount of algebra and the definition 
of the correlation coefficients given in Eq. (13). Equation (37) is valid only when m + n > 2, 
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which is the case of interest to us (since the Taylor expansion of /(x, y) has only terms with 
m + n > 2). 

The important point about Eq. (37) is that the m and n dependence follows that of the 
original Taylor expansion. Thus, when we insert this equation back into Eq. (36), we recover 
our original function - all we have to do is interchange the sums. For example, consider inserting 
the first term in Eq. (37) into Eq. (36), 

J2 E ^Mp'^Jriphr = E E ^^nip^^riPi.r = E ^^^^f(p'^ppW ■ 

m,n i<j i<j m,n i<j 

Performing the same set of manipulations on all of Eq. (37) leads to 



i<j 

+ i^T^ E ^^^^nf{p^jk^py + 0{NS)^) . (39) 

i<j<k 

This expression is true in general (except for some technical considerations; see Sec. 5.2.3); 
to restrict it to the KL divergences of interest we set p{r) to ptTue{^) and qir) to either Pindi^) 
or Ppair{^)- In the first case, q{v) = Pind{^)-, £,q{^) = 0, which in term implies that J^ - = 0, and 
thus p1- = 0. Using Eq. (39), we have (to lowest nonvanishing order in N5)^ 



DKLiPtrueWVind) = E ^i^jfiplp^) + ^ {i^^f) ■ (40) 

Then, defining 

^^^'^^ iV(F^T)h;^|jTi-^^^^'°^' ^^^^ 

and recalling that 5 = Vdt, we see that Eq. (40) is equivalent to Eq. (7a). 

In the second case, q{r) = Ppairi}^)^ the first and second moments of Ppair(r) and ptrueij^) 
are equal. This in turn implies, using Eq. (31), that J^- = Jf^ and thus p\- = pl^. Because 
f{x,x) = (see Eq. (35)), we see that the first three terms on the right hand side of Eq. (39) - 
those involving i and j but not k - vanish. The next order term does not vanish, and yields 
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DKL{Ptrue\\Ppa^r) = ]^ Yl ^i^J^kf {p'i.k^ pjjk) + ^ ((^'J)') ' (^2) 

Defining 

N(N -1)(N -2)ln(2) ^ jfjJ'^P'ijk^Pijk)^ (^3) 

i<j<k 

we see that Eq. (42) reduces to Eq. (7b). 



In 2 

i<j<k 



5.2.2 Local fields, pairwise couplings and moments 

In this section we derive, to leading order in N6, expressions relating the local fields and pairwise 
couplings of the maximum entropy model, hi and Jij, to the first and second moments. These 
are the expressions reported in Eq. (15). 

To do this, we simply compute the first and second moment under the assumption that 
N5 is small. This calculation proceeds along the same lines as in the previous section, with 
one extra consideration: the quadratic term in the maximum entropy distribution, Eq. (11), is 
proportional to r^rj, not Srirj. However, to lowest order in N5, this doesn't matter. That's 
because 

JijTirj = JijdviSrj + r.j Jij^j + constants . 

i<j i<j jj^i 

where fj is defined as in Eq. (26d) except with TYf replaced by hi, and we used the fact that 
Jij = Jji. The second term introduces a correction to the local fields, hi. However, the correction 
is 0{N6), so we drop it. We should keep in mind, though, that our final expression for hi will 
have corrections of 0{N5). 

Using Eq. (11), but with replaced by 6ri where it appears with Jij, we may write (after a 
small amount of algebra) 

( \ - ( ^ l + ex-(r)+V'(ex(r)) .... 

Vma.enAv) - p,nd[r) ^ ^ ^^^^^^ ^ V(Cx(r))).„. ^^^^ 

where Pind{^) is the same as the function Pindi^) defined in Eq. (26a), except that li]^ - is replaced 
by h, the subscript "incf' indicates, as usual, an average with respect to Pindi^^), and the two 
functions '^x(r) and -0(2;) are given by 

^xir) = ^JijSri6rj (45) 

i<j 
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and 

ipix) = e"" - 1 - X . (46) 

Given this setup, we can use Eqs. (54) and (55) to compute the moments under the maximum 
entropy model. That's because both ■ip{x) and its first derivative vanish at x = 0, which are the 
two conditions required for these equations to be valid. Using also the fact that {dri)ind = 0, 
Eqs. (54) and (55) imply that 

iUr) + HU^mnd = ^ firj^{J^J) + O {{N6f) (47a) 

{n)maxent = (1 + exp(-/li))"' + O {N 6^) (47b) 
{5ri5rj)maxent = Ufj [V'( J^j ) + Jij] + O {N 5^) . (47c) 



The term ^^+Jij^ in Eq. (47c) came from {Sri6rjS^xi'''^))ind^ see numerator in Eq. (44). Note that 
for the second two equations, we used the fact that, to lowest order in N5, the denominator in 
Eq. (44) is equal to 1. 

Finally, using Eq. (16) for the normalized correlation coefficient, dropping the subscript 
^^maxenf (since the first and second moments are the same under the maxent and true distri- 
butions), and inverting Eqs. (47b) and (47c) to express the local fields and coupling coefficients 
in terms of the first and second moments, we arrive at Eq (15). 

5.2.3 Averages of powers of and 

Here we compute {£,l^S,q)ind, which, as can be seen in Eq. (36), is the the key quantity in our 
perturbation expansion. Our starting point is to (formally) expand the sums that make up 
and S^g (see Eqs. (26b) and (33b)), which yields 

oo 

1=2 {mi,...,m;} ii< - <i; 

The sum over {mi, . . . is a sum over all possible configurations of the rrij. The coefficient 
ipmi,...,mi are complicated functions of the jf^ J^pKPijk^lCl-j^, etc. Computing these functions 
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is straightforward, although somewhat tedious, especiahy when / is large; below we compute 
them only for / = 2 and 3. The reason / starts at 2 is that m + n > 2; see Eq. (36). 

What we will show is that all terms with superscript (/) are 0(5'). To do this, we first note 
that, because the right hand side of Eq. (48) is an average with respect to the independent 
distribution, the average of the product is the product of the averages, 



(49) 



Then, using the fact that 5ri = (1 — r^) with probability and 5ri = (1 — r^) with probability 
(1 — fj) (see Eq. (26c)), we have 

m— 1 



nil 



+ {I - ri){-nr = n{l 



1 



-r,: 



1 - r,; 



(50) 



The significance of this expression is that, for m > 1, {5r^)ind ~ C^(^«) ~ ^'((5), independent of 
m. Consequently, if all the rrij in Eq. (49) are greater than 1, then the right hand side is 0(5'). 
This shows that, as promised above, the superscript (/) labels the order of the terms. 

As we saw in Sec. 5.2.1, we need to go to third order in 6, which means we need to compute 
the terms on the right hand side of Eq. (48) with / = 2 and 3. Let us start with 1 = 2, which 
picks out only on those terms with two unique indices. Examining the expressions for and 
given in Eqs. (26b) and (33b), we see that we must keep only terms involving J'ij, since JCijk has 
three indices, and higher order terms have more. Thus, the / = 2 contribution to the average in 

(2) 

Eq. (48), which we denote {^p{r)Cg{r))]nd^ is given by 



i<j 

Pulling Jfj and J'^j out of the averages, using Eq. (32a) to eliminate Jfj and in favor of p^j 
and pjj, and applying Eq. (50) (while throwing away some of the terms in that equation that 
are higher than second order in 6), the above expression may be written 

i<j 
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Note that we were not quite consistent in our ordering with respect to 5, in the sense that we 
kept some higher order terms and not others. We did this so that we could use pij rather than 
J'ij, as the former are directly observable. 

For I = 3 the calculation is more involved, but not substantially so. Including terms with 
exactly three unique indices in the sum on the right hand side of Eq. (48) gives us 

i<j<k 

This expression is not quite correct, since some of its terms contain only two unique indices 
- these are the terms proportional to {J^ljY"'{J'[j)"' - whereas it should contain only terms with 
exactly three unique indices. Fortunately, this turns out not to matter, for reasons we discuss 
at the end of the section. 

To perform the averages in Eq. (52), we would need to use multinomial expansions, and then 
average over the resulting powers of Jr's. For the latter, we can work to lowest order in the 
6ri, which means we only take the first term in Eq. (50). This amounts to replacing every 6ri 
with 1 — f j (and similarly for j and k), and in addition multiplying the whole expression by an 
overall factor of fifjfk. For example, if m = 1 and n = 2, one of the terms in the multinomial 
expansion is KF^-j^J^jJ^j^{5r^5r'j5rf,)ind- This average would yield, using Eq. (50) and considering 
only the lowest order term, rifjfj^{\ — rif{\ — rj)'^{\ — f^)^. 

This procedure also is not quite correct, since terms with only one factor of Jr^, which average 
to zero, are replaced with 1 — fj. This also turns out not to matter; again, we discuss why at 
the end of the section. 

We can, then, go ahead and use the above "replace blindly" algorithm. Note that the factors 
of 1 — fj, 1 — fj and 1 — ffc turn Jij and fCijk into normalized correlation coefficients (see Eq. (32)), 
which considerably simplifies our equations. Using also Eq. (38) for Pijk, Eq. (52) becomes 

fe(r)"^e.(r)")S = E ^^^M~^^,kr{~P%kT ■ (53) 

i<j<k 

We can now combine Eqs. (51) and (53), and insert them into Eq. (48). This gives us the 
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first two terms in the perturbative expansion of (Cp(r)"^Cg(r)")md; the result is written down in 
Eq. (37) above. 

Why can we ignore the overcounting associated with terms in which an index appears exactly 
zero or one times? We clearly can't do this in general, because for such terms, replacing 6ri 
with 1 — fj fails - either because the terms didn't exist in the first place (when one of the indices 
never appeared) or because they averaged to zero (when an index appeared exactly once). In 
our case, however, such terms do not appear in the Taylor expansion. To see why, note first 
of all that, because of the form of f{x,y), its Taylor expansion can be written (x — y)'^f{x,y) 
where f{x,y) is finite at x = y (see Eq. (35)). Consequently, the expression inside the sum over 
i,j and k in Eq. (52) should really contain a multiplicative factor that arises from (^p — ^q)'^, 
and thus has the form 

As we saw in the previous section, we are interested in the third order term only to compute 
DKL{ptrue\\Ppair), for which J7^^ = J^j. Therefore, the above multiplicative factor reduces to 
(^fjfc ~ ^Ijkf' [Sfi^''' j^'''kf' ■ It is that last factor of (SriSrjdrk)'^ that is important, since it 
guarantees that every term in the Taylor expansion will have all indices appearing at least 
twice. Therefore, although Eq. (52) is not true in general, it is valid for our analysis. 

We end this section by pointing out that there is a very simple procedure for computing 
averages to second order in 6. Consider a function (j){S,pi(,q) such that (j){S,pjS,q) its first 
derivatives vanish at £,p = £,q = 0. Then, based on the above analysis, we have 

mp,^q)hnci = Y.r.rMJt,^J'j) + 0{{N6f) . (54) 

i<j 

Two easy corollaries of this are: for k and I positive integers, 

{Sr^m^^,))^nd = Y,nf,4>{jf^,jf^) + O {N^6^) (55a) 

{drfdr'jcPiCp, U))ind = nf,4>{jf^,J^^) + O (iVJ^) (55b) 

where the sum in Eq. (55a) run over j only, and we used the fact that both Jf- and J^^ are 
symmetric with respect to the interchange of i and j. 
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5.3 Generating synthetic data 

As can be seen in Eq. (10), synthetic data depends on three sets of parameters: /i*™*^, J/J"*^, and 
K^jJ^^. Here we describe how they were generated. 

Our first step was to generate the /i*™^. To do that, we chose a vector r* = 
(where, recah, A^* = 15 is the number of neurons in our base distribution), from an exponential 
distribution with mean 0.02. From this we chose the local field according to Eq. (15a), 



In the perturbative regime, a distribution generated with these values of the local fields will have 
firing rates approximately equal to the r?; see Eq. (15a) and Fig. 6. We then draw J^J"*^ and 
^tjk^ from Gaussian distributions with means equal to 0.05 and 0.02 and standard deviations 
of 0.8 and 0.5, respectively. Using non-zero values for /Cijfc, means that the distribution is not 
pairwise. 

5.4 Bin size and the correlation coefficients 

One of our main claims is that A^v is linear in bin size, 6t. This is true, however, only if gi^d 
and Qpair are independent of 5t, as can be seen from Eq. (3). In this section we show that 
independence is satisfied if 5t is smaller than the typical correlation time of the responses. For 
5t larger than such correlation times, gind and Qpair do depend on 5t, and Ajv is no longer linear 
in 5t. Note, though, that the correlation time is always finite, so there will always be a bin size 
below which the linear relationship, A ^ 5t, is guaranteed. 

Examining Eqs. (41) and (43), we see that gind and gpair depend on the normalized correlation 
coefficients, pij, pijk (we drop superscripts, since our discussion will be generic). Thus, to 
understand how g^d and gpair depend on bin size, we need to understand how the normalized 
correlation coefficients depend on bin size. 

We start with the second order correlation coefficient, since it is simplest. The corresponding 
cross-correlogram, which we denote Cjj(T), is given by 





(56) 
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where is the time of the k^^ spike on neuron i (and similarly for t^) and 6{-) is the Dirac 
5-function. The normalization in Eq. (56) is slightly non-standard - more typical is to divide 
by something with units of firing rate (fj, i^j or (t'ji/j)^/^), to give units of spikes/s. The 
normalization we use is convenient, however, because in the limit of large r, Cijir) approaches 
one. 

It is slightly tedious, but otherwise straightforward, to show that when 5t is sufficiently small 
that only one spike can occur in a time bin, pij is related to Cij{T) via 

P^J = ll dr (1 - \T\/5t) (Q,(t) - 1) . (57) 

The (unimportant) factor (1 — |T|/(5t) comes from the fact that the spikes occur at random 
locations within a bin. 

Equation (57) has a simple interpretation: pij is the average height of the central peak of the 
cross-correlogram relative to baseline. How strongly pij depends on 6t is thus determined by the 
shape of the cross-correlogram. If it is smooth, then pij approaches a constant as 5t becomes 
small. If, on the other hand, there is a sharp peak at r = 0, then pij ~ 1/V6t = 1/5 for small 
5t, so long as 5t is larger than the width of the peak. (The factor of z7 included in the scaling is 
approximate; it is a placeholder for an effective firing rate that depends on the indices i and j. 
It is, however, sufficiently accurate for our purposes.) A similar relationship exists between the 
third order correlogram and the correlation coefficient. Thus, pijk is also independent of 5t in 
the small 5t limit, whereas if the central peak is sharp it scales as 1/5^. 

The upshot of this analysis is that the shape of the cross-correlogram has a profound effect 
on the correlation coefficients and, therefore, on A^v. What is the shape in real networks? 
The answer typically depends on the physical distance between cells. If two neurons are close, 
they are likely to receive common input and thus exhibit a narrow central peak in their cross- 
correlogram. If, on the other hand, the neurons are far apart, they are less likely to receive 
common input. In this case, the correlations come from external stimuli, so the central peak 
tends to have a characteristic width given by the temporal correlation time of the stimulus, 
typically 100s of milliseconds. 

Although clearly both kinds of cross-correlograms exist in any single population of neurons, 
it is convenient to analyze them separately. We have already considered networks in which 
the cross-correlograms were broad and perfectly flat, so that the correlation coefficients were 
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strictly independent of bin size. Based on our analysis in Sec. 5.2, we can also consider the 
opposite extreme: networks in which the the cross-correlograms (both second and higher order) 
among nearby neurons exhibit sharp peaks while those among distant neurons are uniformly 
equal to 1. In this regime, the correlation coefficients depend on 6t: as discussed above, the 
second order ones scale as 1/5 and the third as 1/5'^. This means that the arguments of f{pij,0) 
and fipfjk'^,Pijff) are large. From the definition of f{x,y) in Eq. (35), in this regime both are 
approximately linear in their arguments (ignoring log corrections). Consequently, f{pij,0) ~ 1/5 
and f{ptjr,pfjk) ~ l/<52. This implies that 

gind and Qpair scale as N6 and N'^d, respectively, 
and so Ajv ~ -/V, independent of 5. Thus, if the bin size is large compared the the correlation 
time, Aat will be approximately independent of bin size. 

5.5 Assessing goodness of fit for independence across time assumption 

In this section we derive the expression for 7 given in Eq. (22). Our starting point is its definition, 
Eq. (21). It is convenient to define R to be a concatenation of the responses in M time bins, 

R^(ri,r2,...,r*^) (58) 

where, as in Sec. 3, the superscript labels time, so 'P(R) is the full, temporally correlated, 
distribution. 

With this definition, we may write the numerator in Eq. (21) as 

DKL{r{R)\\llppa^rir')) = -5*^^ " Y.T.Ptrueir) log2Ppa^r{r) (59) 
t t r 

where Sjf^n is the entropy of the full distribution, 'P(R), the last sum follows from a marginal- 
ization over all but one element of 'P(R), and Ptmei^) true distribution at time r. Note 

that Ppair{^) is independent of time, since it is computed from a distribution averaged over all 
bins. That distribution, which we have called ptrue(r), is given in terms of p^j.„g(r) as 

t 

Inserting this definition into Eq. (59) eliminates the sum over t, and replaces it with Mptmei^)- 
For simplicity we consider the maximum entropy pairwise model. In this case, because Ppair{^) 
is in the exponential family, and the first and second moments are the same under the true and 
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maximum entropy distributions, we can replace ptrue{^) with Pmaxent{^)- Consequently, Eq. (59) 
becomes 



DKL{v{n)\\J[ppair{y')) = MSmaxent " S, 



'trite ■ 



t 



This gives us the numerator in the expression for 7 (Eq. (21)). The denominator, D k L{ptrue\\Pind) 
is equal to Sind — Strue (see Eq. (4)). This leads to 

^ M{Smaxent ~ Strue) ^ MStrue ~ Spf^^ (60) 

M{Sind ~ Strue) M[Sind ~ Strue) 

where we added and subtracted MStrue to the numerator. 

The first term on the right hand side of Eq. (60) we recognize, from Eqs. (3), (4) and (5), 
as Aat. To cast the second into a reasonable form, we define Sf^^ to be the entropy of the 
distribution that retains the temporal correlations within each neuron but is independent across 
neurons. Then, adding and subtracting this quantity to the numerator in Eq. (60), and also 
adding and subtracting MSmd, we have 

^ _ _|_ jSj^id ~ Sjfue) ~ M{Sind — Strue) + {MSjnd — Sf^^ ^^^^ 

M^{Sind ~ Strue) 

The key observation is that if M6 ^ 1, then 

S^d - S^ue = 9^ndN{N - 1){M6)^ . 

Comparing this with Eq. (7a), we see that S^^^ — Strue is a factor of times larger than 
Sind - Strue- We thus have 

7 = A^ + (M-l)+ ^f"'^"5'^ (62) 

{>^ind '-'true ) 

Again assuming M6 <C 1, and defining h{x) = — xlog2 x — {1 — x) log2(l — x), the last term in 
this expression may be written 

MSind - = mY, Kn) - Yl ^(^^^^) -MY,r^ log M = A^^ M log^ M . (63) 

i i i 

Inserting this into Eq. (62) and using Eq. (7a) yields Eq. (22). 

We have assumed here that M6 <^ 1; what happens when Ad 6 ~ 1, or larger? To answer 
this, we rewrite Eq. (60) as 

Smaxent ~ Strue/ /ca\ 

7 = X • 

38 



We argue that in general, as M increases, Sij.^^lM becomes increasingly different from S^axent^ 
since the former was derived under the assumption that the responses at different time bins were 
independent. Thus, Eq. (22) should be considered a lower bound on 7. 
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